The purpose of this project is to develop an in-depth fully Bayesian analysis to model wine quality based on physicochemical tests such as :
The output variable is the quality good or bad based on these physicochemical properties. The dataset is free available on UCI Machine Learning Repository , one of the most huge database of machine learning problems hold by the Center for Machine Learning and Intelligent Systems at the University of California.
As first thing let’s divide the dataset into train and test with proportion equal to \(80\% - 20\%\) respectively (for evaluating at the and the performances of our model) and let’s see how the features are distributed and some other characteristics of the dataset.
To avoid to display all the features here I report only the ones most known, for the others we are going to see only a brief summary.
Fixed acidity is an important features of wines. Acids are fundamental for the wine conservation, for example a long-lived wine suitable for aging must have a relatively high fixed acidity. In particular fixed acids give a feeling of freshness in the mouth which must be balanced by the sensations of warmth and softness induced by alcohol and sugars. Based on acidity wine can be :
The pH of wine is a measure of the strength and concentration of the dissociated acids present in that medium. It is calculated using the concentration of hydrogen ions in the formula \(pH = -log10[H+]\) and can be adjusted through the addition of acid or base. It plays a critical role in many aspects of winemaking, in particular wine stability and red wine colour. For wine it is usually in the range \([2.5,4]\)
Alcohol in wine has a content varying between 4% (in some sweet sparkling wines) to 20% or more (in liqueur wines).Wines with an alcohol content of up to 10% are generally defined as “light,” whereas wines with an increasing alcohol content are often defined as more or less “hot.” In our dataset most of the wines are “light” (alcohol \(\leq 10\%\)) and “quite hot” (alcohol between \(11-12\%\)), there are only few wines “strongly alcoholic” (alcohol \(\geq 15\%\))
Sulfites are a food preservative widely used in winemaking, thanks to their ability to maintain the flavor and freshness of wine. In fact thanks to their antimicrobial properties, these compounds can prevent bacterial growth to prolong the shelf life of wines and other products. Moreover they improve taste,appearance and enhance flavor. Less sulfites are always better, but in general a typical red wine has around \(50-100 \ \ mg/L\) of them.
Here I report the main statistics of all the features.
| Min | Max | Median | Mean | Standard Dev | |
|---|---|---|---|---|---|
| Fixed acidity | 4.7 | 15.9 | 7.9 | 8.339 | 1.73 |
| Volatile acidity | 0.12 | 1.58 | 0.52 | 0.528 | 0.181 |
| Citric acid | 0 | 1 | 0.26 | 0.271 | 0.194 |
| Residual sugar | 0.9 | 15.4 | 2.2 | 2.531 | 1.381 |
| Chlorides | 0.012 | 0.611 | 0.079 | 0.088 | 0.048 |
| Free sulfur dioxide | 1 | 72 | 14 | 16.021 | 10.538 |
| Total sulfur dioxide | 6 | 289 | 38 | 46.711 | 32.993 |
| Density | 0.99007 | 1.00369 | 0.99676 | 0.997 | 0.002 |
| pH | 2.74 | 4.01 | 3.31 | 3.311 | 0.156 |
| Sulphates | 0.33 | 2 | 0.62 | 0.657 | 0.166 |
| Alcohol | 8.4 | 14.9 | 10.1 | 10.422 | 1.07 |
The number of wines that have been tested in total is 1599, this imply that the number of observation in the training set is 1279 and as wee can see it is almost perfectly balanced so we don’t need some oversampling method to increase the data of the minority class.
At first glance we can see the importance of one features in its capacity of dividing perfectly the two classes. Unfortunately there isn’t a feature for which wines belonging to different classes have different values,values are evenly distributed between classes, only for alcohol and for total sulfur dioxide we can have a slight separation.
Bad wines seems to have a small amount of alcohol, and wines with an high amount of alcohol are more likely to be Good wines.
Free sulfur dioxide helps to protect wine from oxidation and spoilage, but as we can see wine with an high amount of this molecule are likely to be classified as bad. In fact an excessive quantity of \(FS0_2\) can be perceptible to consumers,by masking the wine’s own fruit aromas and by contributing a sharp,bitter,metallic,chemical flavor or sensation.
Let’s compute the Pearson correlation coefficient between all the couples of features and see if there are features strongly correlated.
As we can expected some features are strongly correlated such as fixed acids-pH, grater the presence of acids lower the pH (think to pH formula and to logarithmic scale ), but also density-alcoholic content, in fact grater the quantity of ethanol lower the density of the wine.
In our set-up the response variable wine quality can be modeled as a Bernoulli distribution : \[ Y_i \in {0,1} \ where\ 0\ is\ bad\ and\ 1\ is\ good \] In particular Bernoulli distribution belongs to the exponential family distribution that’s why we can use Generalized Linear Model for analyzing the response variable. The only things we need to define now is:
the so called systematic component or linear predictor which is a function of the explanatory variables similarly as in normal regression model, it is like \(\eta_i=X_i\beta=\beta_0+\sum_{j=1}^px_{ij}\beta_j\)
link function \(g(\theta)\) which is the mathematical expression that connects the value of the linear predictor with the response Y. So it has to map the set of real numbers \(\mathbb{R}\) in which the linear predictor takes values to the range of values in which the parameter of interest lies.For binary data the canonical link function is the logit function \(g(\pi)=log\left(\frac{\pi}{1-\pi}\right)\) or the probit link function \(g(\pi)=\Phi^{-1}(\pi)\) which is the inverse of the Normal CDF, or the log-log link function \(g(\pi)=log\{-log(1-\pi)\}\) which is less popular link, but it models more efficiently the tails of the distribution , especially when asymmetry between low and high probability values is observed.
Using all the explanatory variables we have the Generalized Linear Model is:
\[ Y_i\sim Bernoulli(p_i)\\ logit(p_i)=log\left(\frac{p_i}{1-p_i}\right)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\beta_3 x_{i3}+\beta_4 x_{i4}+\beta_5 x_{i5}+\beta_6 x_{i6}+\beta_7 x_{i7}+\beta_8 x_{i9}+\beta_9 x_{i9}+\beta_{10} x_{i10}+\beta_{11} x_{i11}\\ \beta_i\sim Normal\left(\mu=0,\tau^2=\frac{1}{10^{6}}\right)\qquad for \ \ i \in{1,2,...,11} \] Model coefficients \(\beta_i\) are typically defined as Normal because they can take values between all real numbers and the Normal supports is exactly \((-\infty,+\infty)\).
model <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) <-beta0+beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta5*x5[i] + beta6*x6[i] + beta7*x7[i] + beta8*x8[i] + beta9*x9[i] +beta10*x10[i]+beta11*x11[i]
}
# Defining the prior beta parameters
beta0 ~ dnorm(0, 1.0E-6)
beta1 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta3 ~ dnorm(0, 1.0E-6)
beta4 ~ dnorm(0, 1.0E-6)
beta5 ~ dnorm(0, 1.0E-6)
beta6 ~ dnorm(0, 1.0E-6)
beta7 ~ dnorm(0, 1.0E-6)
beta8 ~ dnorm(0, 1.0E-6)
beta9 ~ dnorm(0, 1.0E-6)
beta10 ~ dnorm(0, 1.0E-6)
beta11 ~ dnorm(0, 1.0E-6)
}
Running JAGS with 3 chains, 10000 iterations, a burn-in of 1000 steps and a thinning of 10 we have:
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta0 | 50.0141 | 88.1723 | -121.2143 | -10.2418 | 49.7109 | 108.9489 | 222.7448 | 1.0019 | 2700 |
| beta1 | 0.1539 | 0.1113 | -0.0688 | 0.0786 | 0.1534 | 0.2313 | 0.3726 | 1.0012 | 2700 |
| beta10 | 3.2509 | 0.5371 | 2.2244 | 2.8857 | 3.2430 | 3.6082 | 4.3148 | 1.0006 | 2700 |
| beta11 | 0.8822 | 0.1155 | 0.6629 | 0.8034 | 0.8817 | 0.9607 | 1.1073 | 1.0019 | 1400 |
| beta2 | -3.6282 | 0.5410 | -4.7027 | -3.9973 | -3.6057 | -3.2665 | -2.5934 | 1.0005 | 2700 |
| beta3 | -1.6651 | 0.6367 | -2.8967 | -2.0901 | -1.6795 | -1.2334 | -0.4021 | 1.0009 | 2700 |
| beta4 | 0.0934 | 0.0623 | -0.0326 | 0.0528 | 0.0940 | 0.1344 | 0.2186 | 1.0019 | 1400 |
| beta5 | -4.9134 | 1.8238 | -8.5378 | -6.1146 | -4.9182 | -3.6971 | -1.3180 | 1.0006 | 2700 |
| beta6 | 0.0263 | 0.0093 | 0.0082 | 0.0200 | 0.0264 | 0.0325 | 0.0443 | 1.0023 | 1100 |
| beta7 | -0.0178 | 0.0033 | -0.0243 | -0.0200 | -0.0178 | -0.0156 | -0.0114 | 1.0014 | 2100 |
| beta8 | -58.4842 | 89.9910 | -233.5058 | -119.0697 | -58.0267 | 2.8774 | 116.1996 | 1.0019 | 2700 |
| beta9 | -0.3254 | 0.7955 | -1.9092 | -0.8483 | -0.3179 | 0.2116 | 1.1950 | 1.0009 | 2700 |
| deviance | 1309.8859 | 5.0944 | 1302.2275 | 1306.3665 | 1309.2979 | 1312.7002 | 1320.7248 | 1.0027 | 880 |
DIC of the model 1322.8427392.
The point estimates for \(\beta_i\) are shown in the “Mean” column. The \(95\%\) credible intervals for \(\beta_0,\beta_1,\beta_4,\beta_8,\beta_9\) contain 0 , which means that the variables \(x0 (intercept),\ x1\ (fixed acidity),\ x4\ (residual sugar),\ x8\ (density) ,\ x9\ (pH)\) can be statistically insignificant and we can try to remove them and see if the DIC decreases. Moreover also for the other parameters we have nice results:
Rhat ( potential scale reduction factor ) is a measure of the convergence of the chain, it indicates if all the chains converge to the same value, in particular it is a comparison between chains variance and within chains variance, if they are similar they become from the same distribution. The value of Rhat must be between 1 and 1.1
n.eff indicates the effective sample size that can be interpreted as the number of independent Monte Carlo samples necessary to give the same precision of the MCMC samples. Grater this value , lower the autocorrelation between the MCMC steps and better is the final approximation. In fact let’s make a remark on the quality of a MCMC : The quality of this approximation derives from the variance of the MCMC that we have performed and that is equal to the MC variance plus a term that depends on the correlation of the samples within the Markov chain. Generally this term is positive and so we expect an approximation further away form the MC’s one.The formula is: \[ Var_{MCMC}[\hat{I}]=\frac{Var[\hat{I}]}{S_{eff}}=\left(1+2\sum_{k=1}^{\infty}\rho_k\right)\frac{\sigma^2}{t} \] where \(\hat{I}\) is the mean of the parameter we are interested in.
DIC (deviance information criterion) is particular useful in Bayesian model selection, because it measures the “goodness-of-fit” of the model.There isn’t a standard absolute scale for DIC, it can take any positive value and it is equal to \(DIC=D_{\hat{\theta}}(y)+2p_D\) where \(D_{\hat{\theta}}(y)\) is the deviance of the model (-2*log(likelihood)) and \(p_D\) is the number of effective (“unconstrained”) parameters in the model.Since \(D_{\hat{\theta}}(y)\) will decrease as the number of parameters in a model increases, the \(p_D\) term compensates for this effect by favoring models with a smaller number of parameters.Models with smaller DIC should be preferred to models with larger DIC.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 49.5255 | 90.4013 | 0.5478 | 0.5838 |
| x1 | 0.1528 | 0.1118 | 1.3670 | 0.1716 |
| x2 | -3.5947 | 0.5491 | -6.5464 | 0.0000 |
| x3 | -1.6427 | 0.6314 | -2.6015 | 0.0093 |
| x4 | 0.0933 | 0.0617 | 1.5105 | 0.1309 |
| x5 | -4.7436 | 1.7955 | -2.6419 | 0.0082 |
| x6 | 0.0259 | 0.0093 | 2.7960 | 0.0052 |
| x7 | -0.0176 | 0.0033 | -5.3914 | 0.0000 |
| x8 | -57.9923 | 92.2671 | -0.6285 | 0.5297 |
| x9 | -0.3050 | 0.8017 | -0.3804 | 0.7036 |
| x10 | 3.1833 | 0.5360 | 5.9389 | 0.0000 |
| x11 | 0.8763 | 0.1189 | 7.3721 | 0.0000 |
As we can see from this summary ,the estimated coefficients obtained by the frequentist approach are very similar to the ones obtained with jags. Also the AIC score is similar : 1321.8358064 ( we are comparing AIC and DIC because In models with negligible prior information, DIC will be approximately equivalent to AIC ). As these results indicates that x0 (intercept), x1 (fixed acidity), x4 (residual sugar), x8 (density) , x9 (pH) maybe are not so informative,appropriate for the model.
Let’s try to remove them ans see the results.
As said above we can try to eliminate some features in the model as see what happens. After some trials and trying different combination with the features that initially seem the less informative (intercept,x1,x4,x8,x9) I find that the features that really compromise the tests for convergence that we will see later is the intercept,x4 and x9.
model_filtered <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) <-beta1*x1[i] + beta2*x2[i] + beta3*x3[i] +beta5*x5[i] + beta6*x6[i] + beta7*x7[i]+beta8*x8[i]+beta10*x10[i]+beta11*x11[i]
}
# Defining the prior beta parameters
beta1 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta3 ~ dnorm(0, 1.0E-6)
beta5 ~ dnorm(0, 1.0E-6)
beta6 ~ dnorm(0, 1.0E-6)
beta7 ~ dnorm(0, 1.0E-6)
beta8 ~ dnorm(0, 1.0E-6)
beta10 ~ dnorm(0, 1.0E-6)
beta11 ~ dnorm(0, 1.0E-6)
}
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta1 | 0.1440 | 0.0577 | 0.0350 | 0.1040 | 0.1436 | 0.1822 | 0.2576 | 1.0006 | 2700 |
| beta10 | 3.0581 | 0.5199 | 2.0241 | 2.7100 | 3.0569 | 3.4058 | 4.0766 | 1.0008 | 2700 |
| beta11 | 0.9270 | 0.0844 | 0.7646 | 0.8680 | 0.9281 | 0.9832 | 1.0902 | 1.0007 | 2700 |
| beta2 | -3.6798 | 0.5422 | -4.7240 | -4.0426 | -3.6825 | -3.3305 | -2.6397 | 1.0006 | 2700 |
| beta3 | -1.5937 | 0.6290 | -2.8560 | -2.0111 | -1.5898 | -1.1760 | -0.3679 | 1.0009 | 2700 |
| beta5 | -4.3940 | 1.7551 | -7.9050 | -5.5461 | -4.3758 | -3.2014 | -1.0099 | 1.0006 | 2700 |
| beta6 | 0.0272 | 0.0093 | 0.0086 | 0.0208 | 0.0274 | 0.0338 | 0.0454 | 1.0006 | 2700 |
| beta7 | -0.0168 | 0.0032 | -0.0232 | -0.0190 | -0.0168 | -0.0147 | -0.0108 | 1.0007 | 2700 |
| beta8 | -9.5118 | 1.0990 | -11.6672 | -10.2491 | -9.5095 | -8.7763 | -7.4051 | 1.0010 | 2700 |
| deviance | 1310.1463 | 4.3922 | 1303.7774 | 1307.0177 | 1309.4741 | 1312.4607 | 1320.8828 | 1.0024 | 1500 |
Moreover removing them the DIC slightly reduces to 1319.7866759
Let’s see the plots of the Markov chain, the density distribution of the values along the chain and the autocorrelation between these values.
As we can see all the Markov chains remains in the steady state , and also the autocorrelation between consecutive values of the chain is very small. In particular a Markov chain with high autocorrelation moves around the parameter space slowly,taking a long time to achieve the correct balance among the different regions of the parameter space. An independent MC sampler has zero autocorrelation and can jump between different regions of the parameter space in one step. Moreover the higher the autocorrelation in the chain, the larger the MCMC variance and the worse the approximation of the quantity of interest is.
The Geweke diagnostic takes two nonoverlapping parts (usually the first 0.1 (after the burn-in) and last 0.5 proportions) of the Markov chain and compares the means of both parts, using a difference of means test to see if the two parts of the chain are from the same distribution (null hypothesis). If the two means are equal the chain is probable that is going to converge.
The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error adjusted for autocorrelation .
| beta1 | beta10 | beta11 | beta2 | beta3 | beta5 | beta6 | beta7 | beta8 | deviance | |
|---|---|---|---|---|---|---|---|---|---|---|
| Z-score chain 1 | -1.8075 | -0.3422 | -0.8420 | -0.4038 | 0.1721 | 1.0140 | 1.5755 | -2.6680 | 1.7418 | 0.1702 |
| Z-score chain 2 | 1.5878 | 0.8119 | -0.1599 | 1.2546 | -1.0915 | 0.2955 | -0.6954 | 2.8194 | -1.4308 | 0.4184 |
| Z-score chain 3 | -1.3249 | 1.4023 | 0.5777 | 0.5500 | 1.6183 | -0.5751 | 0.1580 | -0.9854 | -0.5538 | 0.5673 |
As we can see from the plots of the of the Z-scores of the first chain the majority of the values are inside the area \((-1.965,+1.965)\) so we almost always we accept the equality of the means.
The Heidelberg and Welch diagnostic calculates a test statistic to accept or reject the null hypothesis that the Markov chain is from a stationary distribution.
The diagnostic consists of two parts:
First Part:
Second Part:
If the chain passes the first part of the diagnostic, then it takes the part of the chain not discarded from the first part to test the second part. The halfwidth test calculates half the width of the (1 − α)% credible interval around the mean. If the ratio of the halfwidth and the mean is lower than some \(\epsilon\), then the chain passes the test. Otherwise, the chain must be run out longer.
| Stationarity test | start iteration | p-value | Halfwidth test | Mean | Halfwidth | |
|---|---|---|---|---|---|---|
| beta1 | passed | 91 | 0.0990671028544442 | passed | 0.1457135324004 | 0.00447781714679559 |
| beta10 | passed | 1 | 0.099827788326266 | passed | 3.07010252584141 | 0.0336063960230943 |
| beta11 | passed | 1 | 0.244047137194526 | passed | 0.928647631709531 | 0.00556054821071862 |
| beta2 | passed | 1 | 0.301962882265881 | passed | -3.67417173451653 | 0.0355291030712749 |
| beta3 | passed | 1 | 0.425817122088649 | passed | -1.60782790581219 | 0.0412671722581498 |
| beta5 | passed | 1 | 0.362788675901325 | passed | -4.42365049970399 | 0.118811595962462 |
| beta6 | passed | 1 | 0.0819662655631511 | passed | 0.0271198355612989 | 0.000693497083399823 |
| beta7 | passed | 91 | 0.376982426219757 | passed | -0.016679697105009 | 0.000253341646458508 |
| beta8 | passed | 1 | 0.363482016982527 | passed | -9.54148548499899 | 0.0723475940866602 |
| deviance | passed | 1 | 0.576438582879337 | passed | 1310.3602651859 | 0.302905368426096 |
All tests are passed so we can say that the Markov chain reaches the stationary distribution and that the accuracy of our estimations is small enough (intervals small enough).
Steps (for each parameter):
\[ W=\frac{1}{m}\sum_{j=1}^m s_j^2 \] where \[ s_j^2=\frac{1}{n-1}\sum_{i=1}^n(\theta_{ij}-\bar{\theta}_j)^2 \] \(s_j^2\)is just the formula for the variance of the \(jth\) chain. \(W\) is then just the mean of the variances of each chain.
\[ B=\frac{n}{m-1}\sum_{j=1}^m(\bar{\theta_j}-\bar{\bar{\theta}})^2 \] where \[ \bar{\bar{\theta}}=\frac{1}{m}\sum_{j=1}^{m}\bar{\theta_j} \]
This is the variance of the chain means multiplied by n because each chain is based on n draws.
We can then estimate the variance of the stationary distribution as a weighted average of W and B.
\[ \hat{Var(\theta)}=\left(1-\frac{1}{n}\right)W+\frac{1}{n}B \]
Because of overdispersion of the starting values, this overestimates the true variance, but is unbiased if the starting distribution equals the stationary distribution (if starting values were not overdispersed).
The potential scale reduction factor is \[ \hat{R}=\sqrt{\frac{\hat{Var(\theta)}}{W}} \]
When \(\hat{R}\) is high (perhaps greater than 1.1 or 1.2), then we should run our chains out longer to improve convergence to the stationary distribution.
| Point est. | Upper C.I. | |
|---|---|---|
| beta1 | 1.0020796 | 1.009859 |
| beta10 | 0.9996509 | 1.001040 |
| beta11 | 1.0000762 | 1.000248 |
| beta2 | 1.0003867 | 1.002551 |
| beta3 | 1.0009550 | 1.006106 |
| beta5 | 1.0013559 | 1.005364 |
| beta6 | 1.0027693 | 1.010340 |
| beta7 | 1.0026834 | 1.011983 |
| beta8 | 1.0003802 | 1.003374 |
| deviance | 1.0055060 | 1.015004 |
All the values are near \(1\) which indicates that all 10000 iterations are enough for convergence to the stationary distribution.
To check if the MCMC implementation is correct we can run it on simulated data and verify if it is able to recover the model parameters we have fixed. The steps are the following :
Fix the sample size , in this case I chose the same number of observation we have initially
Generate data from the ones we have, here we can compute the ECDF and then sample from it (sampling from the empirical distribution function ecdf is the same as resampling with replacement from the sample y). This is just bootstrapping.
Fix the model parameters, here I have chosen the ones obtained by the previous model
Compute the linear predictor and the sigmoid value (inverse logit function).
Generate the output as a \(Bernoulli\) r.v. with probability equal to the value of the sigmoid.
Run JAGS and see if it is able to recovery the parameters that generate the data.
#sample size
N<-nrow(dataset)
#simulated data
x1_simulation<-sample(x1,N,replace=TRUE)
x2_simulation<-sample(x2,N,replace=TRUE)
x3_simulation<-sample(x3,N,replace=TRUE)
x5_simulation<-sample(x5,N,replace=TRUE)
x6_simulation<-sample(x6,N,replace=TRUE)
x7_simulation<-sample(x7,N,replace=TRUE)
x8_simulation<-sample(x8,N,replace=TRUE)
x10_simulation<-sample(x10,N,replace=TRUE)
x11_simulation<-sample(x11,N,replace=TRUE)
#fixed model parameters
beta_1_sim<-mod.fit_filtered$BUGSoutput$summary[1,1]
beta_2_sim<-mod.fit_filtered$BUGSoutput$summary[4,1]
beta_3_sim<-mod.fit_filtered$BUGSoutput$summary[5,1]
beta_5_sim<-mod.fit_filtered$BUGSoutput$summary[6,1]
beta_6_sim<-mod.fit_filtered$BUGSoutput$summary[7,1]
beta_7_sim<-mod.fit_filtered$BUGSoutput$summary[8,1]
beta_8_sim<-mod.fit_filtered$BUGSoutput$summary[9,1]
beta_10_sim<-mod.fit_filtered$BUGSoutput$summary[2,1]
beta_11_sim<-mod.fit_filtered$BUGSoutput$summary[3,1]
#linear predictor
linpred <- beta_1_sim*x1_simulation+beta_2_sim*x2_simulation + beta_3_sim*x3_simulation +beta_5_sim*x5_simulation + beta_6_sim*x6_simulation + beta_7_sim*x7_simulation+ beta_8_sim*x8_simulation+beta_10_sim*x10_simulation+beta_11_sim*x11_simulation
# probability for bernoulli
pis <- exp(linpred)/(1+exp(linpred))
#simulated output
y_simulation <- rbinom(nrow(dataset), 1, pis)
model_filtered_simulation<- function(){
# Likelihood
for (i in 1:N){
y_simulation[i] ~ dbern(p[i])
logit(p[i]) <- beta1*x1_simulation[i] + beta2*x2_simulation[i] + beta3*x3_simulation[i] + beta5*x5_simulation[i] + beta6*x6_simulation[i] + beta7*x7_simulation[i] + beta8*x8_simulation[i] + beta10*x10_simulation[i] + beta11*x11_simulation[i]
}
# Defining the prior beta parameters
beta1 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta3 ~ dnorm(0, 1.0E-6)
beta5 ~ dnorm(0, 1.0E-6)
beta6 ~ dnorm(0, 1.0E-6)
beta7 ~ dnorm(0, 1.0E-6)
beta8 ~ dnorm(0, 1.0E-6)
beta10 ~ dnorm(0, 1.0E-6)
beta11 ~ dnorm(0, 1.0E-6)
}
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta1 | 0.1153 | 0.1681 | -0.2075 | 0.0021 | 0.1164 | 0.2261 | 0.4548 | 1.0017 | 1500 |
| beta10 | 3.2007 | 1.6388 | 0.1839 | 2.0332 | 3.1066 | 4.3070 | 6.5574 | 1.0106 | 200 |
| beta11 | 0.3957 | 0.2699 | -0.0991 | 0.2117 | 0.3817 | 0.5798 | 0.9473 | 1.0101 | 210 |
| beta2 | -3.4852 | 1.7046 | -6.9256 | -4.5792 | -3.4031 | -2.3056 | -0.3239 | 1.0078 | 270 |
| beta3 | -1.2256 | 1.1792 | -3.6053 | -1.9988 | -1.1916 | -0.3955 | 0.9608 | 1.0007 | 2400 |
| beta5 | -6.7605 | 7.9983 | -24.2601 | -11.7522 | -6.0978 | -1.1951 | 7.4149 | 1.0067 | 390 |
| beta6 | 0.3194 | 0.0821 | 0.1530 | 0.2594 | 0.3210 | 0.3877 | 0.4558 | 1.0223 | 100 |
| beta7 | -0.5004 | 0.1573 | -0.7414 | -0.6440 | -0.5105 | -0.3898 | -0.1730 | 1.0384 | 65 |
| beta8 | -5.7586 | 3.3962 | -12.4964 | -7.9482 | -5.6193 | -3.3947 | 0.4208 | 1.0071 | 360 |
| deviance | 100.2004 | 22.8993 | 84.1825 | 89.7244 | 94.8615 | 103.5644 | 147.8861 | 1.0099 | 220 |
As we can see the values recovered from JAGS are very close to the ones I have fixed ,this means that our MCMC implementation is correct.
We have checked that all the chains have passed the test of convergence so we can gather them and make a more accurate estimation of the parameters of interest. The principal statistics that I decide to find is the posterior points estimate, the equi-tailed intervals and the highest posterior density interval (HPD).
| point estimate | Quantile interval left | Quantile interval right | HPD interval left | HPD interval right | |
|---|---|---|---|---|---|
| beta1 | 0.1440 | 0.0350 | 0.2576 | 0.0362 | 0.2580 |
| beta10 | 3.0581 | 2.0241 | 4.0766 | 2.0565 | 4.1054 |
| beta11 | 0.9270 | 0.7646 | 1.0902 | 0.7523 | 1.0771 |
| beta2 | -3.6798 | -4.7240 | -2.6397 | -4.7152 | -2.6308 |
| beta3 | -1.5937 | -2.8560 | -0.3679 | -2.8228 | -0.3437 |
| beta5 | -4.3940 | -7.9050 | -1.0099 | -7.8481 | -0.9458 |
| beta6 | 0.0272 | 0.0086 | 0.0454 | 0.0076 | 0.0437 |
| beta7 | -0.0168 | -0.0232 | -0.0108 | -0.0227 | -0.0104 |
| beta8 | -9.5118 | -11.6672 | -7.4051 | -11.6077 | -7.3635 |
| deviance | 1310.1463 | 1303.7774 | 1320.8828 | 1303.0794 | 1318.9505 |
Now we have all the necessary things for evaluate the performances of our model. We have “trained” the model on the \(80\%\) observations of the original dataset, let’s predict the result for the test set and compare with the real output.
x1_test<-dataset_test$fixed.acidity
x2_test<-dataset_test$volatile.acidity
x3_test<-dataset_test$citric.acid
x4_test<-dataset_test$residual.sugar
x5_test<-dataset_test$chlorides
x6_test<-dataset_test$free.sulfur.dioxide
x7_test<-dataset_test$total.sulfur.dioxide
x8_test<-dataset_test$density
x9_test<-dataset_test$pH
x10_test<-dataset_test$sulphates
x11_test<-dataset_test$alcohol
y_test<-convertion_vec(dataset_test$quality)
y_test<-as.vector(y_test)
linear_pred_test<-beta.hat.jags[1]*x1_test+beta.hat.jags[4]*x2_test+
beta.hat.jags[5]*x3_test+beta.hat.jags[6]*x5_test+
beta.hat.jags[7]*x6_test+beta.hat.jags[8]*x7_test+
beta.hat.jags[9]*x8_test+beta.hat.jags[2]*x10_test+
beta.hat.jags[3]*x11_test
prob_predict_test<-exp(linear_pred_test)/(1+exp(linear_pred_test))
y_pred<-rbinom(nrow(dataset_test),1,prob_predict_test)
accuracy<-mean(y_test==y_pred)
The accuracy reached is: 0.65 !! This is a really good result utilizing only a simple classifier such as the Logistic Regression, we could try to improve the performance for example taking a more complex model ,maybe not a linear one, but a model with different combination of the features or different priors for the parameters.